Overview and Motivation

Provide an overview of the project goals and the motivation for it. Consider that this will be read by people who did not see your project proposal. In our project, we use the SEDA dataset that maps student achievement based on different district level covariates. The Educational Opportunity Project at Stanford University measures educational opportunity in diverse communities all over America. We loaded the file from the publicly available datasets that the SEDA project has online here We use the inner join function to join the achievement scores dataset and the covariates dataset.

Initial Questions

What questions are you trying to answer? How did these questions evolve over the course of the project? What new questions did you consider in the course of your analysis?

Part 1: Predicting District Academic Achievement

In the first part of our project, we aim to understand the variables that predict academic achievement on a district level. This section of our project has three components. First, we will use explortory data analysis and machine learning modeling to determine key predictors of academic achievment on which to focus. Next we will take a closer look at some socio-economic status, unemployment, and poverty as predictors of academic achievement, using Washington and Massachusetts as case studies. Finally, we will explore how receiving free lunch at school predicts academic success.

Data Source:

Data for this portion of the project is obtained from the Stanford

this still needs to be done– describing the data source

# Load achievement data (district, CS, pooled) 
# data obtained from: "https://stacks.stanford.edu/file/druid:db586ns4974/seda_geodist_pool_cs_4.0.dta"
# This data set contains data on the outcome of interest, district academic achievement.
# In this data set, academic achievement is measured in terms of standard deviation 
# units and is what we will be using for our regression analyses
ach <- read_dta("seda_geodist_pool_cs_4.0.dta")  

# Load main covariates data (district level), pool 
# data obtained from https://stacks.stanford.edu/file/druid:db586ns4974/district%20covariates.dta"
# This data set contains an extensive list of covariates that will be tested as
# predictors of our outcome, academic achievement
covs <- read_dta("seda_cov_geodist_pool_4.0.dta")


# Merge files together. This keeps only matched districts.
dat <- inner_join(ach, covs, by = c("sedalea", "fips"))


# Subset to the "all" subgroup estimates for all students, only districts
# with an estimate of average achievement, unemployment, poverty, ses avergage, and non-missing SES.
dat <- filter(dat,
              subgroup == "all",
              !is.na(cs_mn_avg_ol),
              !is.na(unempavgall),
              !is.na(povertyavgall),
              !is.na(sesavgall))
nrow(dat)
## [1] 12438

After creating this data set, we noticed that the covariate data set that we used did not include variables related to district spending. This is an area of interest of our team, as it shows how policies and funding can play a round in academic outcomes. Therefore, obtained an additional SEDA dataset with district spending included and merged this with the data set above.

#importing the data set that includes variables of interest: ppexp_tot, ppexp_inst, pprev_tot
spending <- read_dta("district covariates.dta")  

# Subsetting data set to only include variables of interest
# also including the district ID, titled leaid here, and state ID (fips) for joining
spending <- subset(spending, select = c(ppexp_tot, ppexp_inst, pprev_tot, leaid, fips))

# since the name of the district ID variable in the dataset "spending" (leaid) is different
# from the district ID variable in the data set "dat" (sedalea), we need to change this

spending <- rename(spending, sedalea = leaid)

#checking the class of `sedalea` in "spending" df 
class(spending$sedalea)
## [1] "character"
#checking the class of `sedalea` in "dat" df

class(dat$sedalea)
## [1] "numeric"
#converting `sedalea` in "spending" df to numeric class so that they are both of the same class
spending$sedalea <- as.numeric(spending$sedalea)

#confirming that `sedalea` in "spending" is now numeric
class(spending$sedalea)
## [1] "numeric"
# Merge files together. This keeps only matched districts.
dat <- inner_join(dat, spending, by = c("sedalea", "fips"))

Part 1a: Deciding the predictors of interest

Explorotory Analysis

For the exploratory data analysis, we start by summarizing our outcome of interest: district academic achievement. As mentioned in the code above, the data set that we imported for our measures academic achievement in terms of standard deviation units away form the national mean. While we can compare relative performance using this measure, interpreting standard deviation units for exploratory data analysis is not too useful, and can be difficult to interpret. Therefore, we will import another SEDA dataset that measures academic achievement outcomes in terms of “grade levels” using the variable gcs_mn_avg_ol.

As with cs_mn_avg_ol, this variable indicates the average math and reading scores for grades 3 through 9 from academic years 2008-2009 to 2015-2016. This scale, however, is in “grade levles” where 5.5 is average across all districts (mid point of grades 3 and 8, the grades that participate in testing). Using this scale means that if a district has an average grade level of 7.5, their students math and reading scores are, on average, similar to students who are one grade level higher, when compared to an average district in the nation.

# loading SEDA data that includes academic achievement in "grade levels"

ach.EDA <- read_dta("seda_geodist_pool_gcs_4.0.dta")  

ach.EDA <- filter(ach.EDA,
              subgroup == "all",
              !is.na(gcs_mn_avg_ol))

To begin our exploratory data analysis on our outcome, it would be helpful to first take a look at how much variation we have in our data. We can visualize this first by using a histogram.

ggplot(aes(x=gcs_mn_avg_ol), data = ach.EDA) +
  geom_histogram(color="black", fill="#99CCFF") +
  xlab("Average Test Scores (Grade Levels)") +
  ylab("Frequency")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

From the histogram above, we see that there is decent amount of variation in the district average achievement scores, as measured in “Grade Levels.” For example, some districts have average test scores of around 2.5– that means that district is testing 3 grade levels below the national average of 5.5! On the other hand, there are also some very high achieving districts. For example, some districts have average test scores at around 7.5 grade levels, 2.5 grade levels above the national average.

Our team is hoping to understand some of the district-level factors that leads to such variation in academic performance in the United States.

Before moving onto our analysis, we can also look at the variation in test scores between and within states using box plots:

ach.EDA %>% mutate(stateabb = reorder(stateabb, gcs_mn_avg_ol, FUN = median)) %>% 
  ggplot(aes(stateabb, gcs_mn_avg_ol, group = stateabb)) + 
  geom_boxplot(fill = "#99CCFF") +
  labs(x = "State Abbreviation", y = "Average District Test Scores (Grade Levels)") + theme(axis.text.x = element_text(angle = 90, hjust = 1)) 

From the graph above, we notice there is a lot of variation in test scores both between and within states. Districts in Massachusetts have the highest median test score, performing at about 1.5 grade levels above the national average while New Mexico has the lowest median achievement scores with a median of around 0.5 grade levels below the national average. With that being said, within most states there is a lot of variation in performance. Within a given state, take California for example, you can find districts with both extremely high and extremely low test scores.

In our analysis, we hope to understand the factors that contribute to such variation between districts across the United States. The covariate data set has many different variables, many of which may predict district test scores. However, for the purpose of our project, we want to focus in on a few key covariates in order to produce more focused and meaningful conclusions.

To decide our priority predictors, we can start of by visually summarizing the overall relationships between these variables and our outcome for academic achievement, cs_mn_avg_ol. In contrast our achievement variable used in the plots above, cs_mn_avg_ol measures academic achievement in terms of standard deviations away from the national average.

Before we begin, it would be helpful to summarize create a list of our predictors for this analysis. The covariates we are exploring include:

  • urban: if the district is in a city/urban locale (binary; 0 = no, 1 = yes)
  • avgrdall: average per grade enrollment (number of students)
  • perfrl: percent free lunch in the grade (percentage; range: 0-1)
  • perell: % of all students in the district that are English Language Learners (ELL ) (percentage; range: 0-1)
  • perspeced: % of all students in the district that qualify for Special Education (percentage; range: 0-1)
  • nsch: number of schools in district (count)
  • aides: number of instructional aides (count)
  • corsup: number of instructional coordinators and supervisors (count)
  • elmgui: number of Elementary Guidance Counselors (count)
  • stutch_all: pupil teacher ratio (ratio of number of pupils to number of students in district)
  • ppexp_tot: total per pupil expenditures (dollars)
  • ppexp_inst: total per pupil expenditures on instruction (dollars)
  • pprev_tot : revenue per pupil (dollars)
  • baplus_all: % of adults in district with at least a bachelors degree (percentage)
  • poverty517_all: % of households with 5-17 year olds in poverty (percentage)
  • singmom_all % of household with children with single mom (percentage)
  • snap_all : % of households receiving snap benefits (percentage)
  • samehouse_all: % of households living in the same house as last year (percentage)
  • unemp_all : % unemployed in the district (percentage)
  • inf_fem: percent of 25 - 64 year old females in labor force (percentage)
  • teenbirth_all: percent of 15-19 year olds giving birth in district (percentage)
  • sesall: standardized socio-economic status composite score computed as the first principal component factor score of the following measures: median income, percent with a bachelor’s degree or higher, poverty rate, SNAP rate, single mother headed household rate, and unemployment rate
  • cs_mn_avg_ol: District test-based achievement math and reading scores from ordinary least quares estimate

NOTE TO OTHERS– IM STILL WORKING ON THIS SECTION– I CHANGED MY LIST OF PREDICTORS TO JUST INCLUDE THE ONES FROM THE “COVS” DATA SET PLUS THE EXPENDITURE ONES SO THE ABOVE LIST AND THE BELOW GRAPHS ARE NOT APPLICABLE. I WILL FIX THIS AND UPDATE WHENEVER THERES A GAP IN PEOPLE WORKING ON THE .RMD PLEASE CARRY ON :)

Oh and i will also make the scatter plots below look better dw

For the continuous predictors, we can do scatter plots:

ML.dta <- read.csv("complete_data_ML.csv")

ML.dta %>% 
  gather(predictor, value, c(avgrdall, perfrl, perell, perspeced, nsch, aides,
                             corsup, elmgui, stutch_all, ppexp_tot, ppexp_inst, 
                             pprev_tot, baplus_all, poverty517_all, singmom_all, 
                             snap_all, samehouse_all, unemp_all, inlf_fem, 
                             teenbirth_all, sesall)) %>% 
  ggplot(aes(x = value, y = cs_mn_avg_ol)) + 
  geom_point() + 
  facet_wrap(~ predictor, scales = 'free_x') + 
  xlab(NULL) + ylab("District Achievement Scores")

For the binary predictor (urban), we can use a box plot:

Our above scatter plots confirm what we initially suspected: many of these variables do in fact predict academic achievement. For example, it appears baplus_all, the percent of adults in the district with a bachelor’s degree, is positively correlated with academic achievement. On the other hand, the percent of households in poverty or with a single mother appear to be negatively correlated. For other variables, such as per pupil expenditure (ppexpt_inst), the association is less clear. Overall, while these scatterplots give us a general sense of what variables may be important, it still is not clear which variables matter the most for academic achievement.

To answer this question, we will turn to machine learning models. We will first use random forest to determine variables of importance in predicting academic achievement, as measured by the Gini coeffient. We will then addition use a regression tree to see which variables are included to predict relative test scores.

Using Machine Learning to Determine Predictors of Importance

Creating the Training and Test Sets

To begin our machine learning exercises, we first must great our training and test sets. Here we are using training and test sets that both include 50% of the dataset.

#creating a data partition that splits my data frame in half
index_train<- createDataPartition(y = ML.dta$low_dist_ach, times =1, p=0.05, list = FALSE)

#labeling the first "slice" as the train set
train_set <- slice(ML.dta, index_train)

#labelign the second "slice" as the test set
test_set <- slice(ML.dta, -index_train)

Random Forest

Next, we will use a Random Forest Model to determine variables of importance, as measured using the Gini Coefficient.

set.seed(1)

#fitting a random forest with all 22 predictors
rf_fit <- randomForest(cs_mn_avg_ol ~ urban + avgrdall + perfrl + perell + perspeced + nsch + aides + corsup + elmgui + stutch_all + ppexp_tot + ppexp_inst + pprev_tot + baplus_all + poverty517_all + singmom_all + snap_all + samehouse_all + unemp_all + inlf_fem + teenbirth_all + sesall, data = train_set, mtry = 22, importance = TRUE)


variable_importance <- importance(rf_fit)

variable_importance_death <- tibble(Predictor = rownames(variable_importance),
                  Gini = variable_importance[,1]) %>%
                  arrange(desc(Gini))

kable(variable_importance_death[1:22,])
Predictor Gini
perfrl 91.5406806
baplus_all 43.7560402
poverty517_all 14.5340365
sesall 14.1672958
inlf_fem 11.7686118
aides 10.2136244
ppexp_inst 9.9599739
ppexp_tot 9.5520153
avgrdall 8.3741901
perell 7.3540926
samehouse_all 7.2292966
pprev_tot 7.1302685
stutch_all 5.8884985
elmgui 5.5631651
singmom_all 5.4166465
corsup 5.3096917
nsch 5.1142034
snap_all 4.6774314
unemp_all 4.4121253
perspeced 2.5750359
teenbirth_all 0.1718682
urban -1.1544165
#predicting probability of estimates for test set
probs_rf <- predict(rf_fit, newdata = test_set, type = "class")

#converting probabilities to predicted response labels

y_hat_rf <- ifelse(probs_rf > 0.5, 1, 0)

#obtaining the confusion matrix

confusionMatrix(data = as.factor(y_hat_rf), 
                reference = as.factor(test_set$low_dist_ach), 
               positive = "1")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 9264 1705
##          1  777    3
##                                           
##                Accuracy : 0.7887          
##                  95% CI : (0.7813, 0.7961)
##     No Information Rate : 0.8546          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : -0.0976         
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.0017564       
##             Specificity : 0.9226173       
##          Pos Pred Value : 0.0038462       
##          Neg Pred Value : 0.8445619       
##              Prevalence : 0.1453741       
##          Detection Rate : 0.0002553       
##    Detection Prevalence : 0.0663886       
##       Balanced Accuracy : 0.4621869       
##                                           
##        'Positive' Class : 1               
## 

From the above outputs we can see two things. First, our random forest model is fairly accurate at predicting test scores with an accuracy of 80.47%. Second, as measured by the Gini coefficient, percent free lunch is the most important predictor of test scores followed by percent of adults who have their bachelors degree. Other important indicators include socio-economic status and poverty.

Regression Tree

To supplement the above results from the random forest, we can also fit our data to a regression tree model.

tree = tree(cs_mn_avg_ol ~ urban + avgrdall + perfrl + perell + perspeced + 
              nsch + aides + corsup + elmgui + stutch_all + ppexp_tot + 
              ppexp_inst + pprev_tot + baplus_all + poverty517_all + 
              singmom_all + snap_all + samehouse_all + unemp_all + 
              inlf_fem + teenbirth_all + sesall, data = train_set)  

summary(tree)
## 
## Regression tree:
## tree(formula = cs_mn_avg_ol ~ urban + avgrdall + perfrl + perell + 
##     perspeced + nsch + aides + corsup + elmgui + stutch_all + 
##     ppexp_tot + ppexp_inst + pprev_tot + baplus_all + poverty517_all + 
##     singmom_all + snap_all + samehouse_all + unemp_all + inlf_fem + 
##     teenbirth_all + sesall, data = train_set)
## Variables actually used in tree construction:
## [1] "perfrl"     "baplus_all" "nsch"       "sesall"    
## Number of terminal nodes:  8 
## Residual mean deviance:  0.02941 = 17.97 / 611 
## Distribution of residuals:
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## -0.641800 -0.106100  0.009115  0.000000  0.114400  0.686400

As seen in teh above output, the regression tree only ened up including three variables in the tree construction: percent of students receiving free lunch (perfrl), percent of adults with at least a bachelors degree (baplus_all), and percent of students who are english language learners (perell).

plot(tree, type = "uniform")
text(tree, cex = 1)

Need to add tree conclusions

So based on these analyses, what predictors should we focus on? It is clear from the random forest and the regression tree models that some variables are more important for academic success than others. For example, percent of students receiving free lunch has high priority in both the random forest model and the regression tree models so warrants attention in future analyses. We additionally want to balance the outcomes of this exploratory data analysis with the interests of our team members. For example, socio-economic status and related predictors, such as poverty and unemployment, are often described as strong predictors of academic success (source?). Therefore, it would also be worthwhile to explore these variable further and the nature of their associations with academic achievement outcomes.

Therefore, we explore different covariates of interest that highly correlate with average district academic achievement, relative to the national mean measured in standard deviation units, including unemployment, free lunch, SES, and poverty. In this section we look at poverty, unemployment, and SES composite scores.

Summary statistics and correlations

To reiterate what we was mentioned above, we use a stargazer to observe the mean and standard deviations of each variable and then review the correlation between each of the unemployment, SES scores, poverty, and average academic achievement variables. We see that SES, poverty, and unemployment are all highly correlated with one another as expected. Furthermore, we observe that all three variables are correlated with average academic achievement with SES and average test scores having a correlation of 0.76, unemployment and average academic achievement have a negative correlation of -0.57 and poverty and average academic achievement also having a negative correlation of -0.67.

dat2 <- dat %>%
  group_by(stateabb) %>%
  summarise(ses = mean(sesavgall), poverty = mean(povertyavgall), unemployment = mean(unempavgall))


stargazer(data.frame(dplyr::select(dat, cs_mn_avg_ol, unempavgall, povertyavgall, sesavgall)),
          type="text")
## 
## ==================================================================
## Statistic       N    Mean  St. Dev.  Min   Pctl(25) Pctl(75)  Max 
## ------------------------------------------------------------------
## cs_mn_avg_ol  12,366 0.021  0.334   -2.341  -0.185   0.222   1.248
## unempavgall   12,366 0.060  0.020   0.002   0.047    0.071   0.221
## povertyavgall 12,366 0.126  0.062   0.007   0.081    0.159   0.460
## sesavgall     12,366 0.333  0.846   -4.401  -0.157   0.872   2.910
## ------------------------------------------------------------------
#correlation between outcome and unemployment, poverty, and ses
round(cor(dplyr::select(dat, cs_mn_avg_ol, unempavgall, povertyavgall, sesavgall)), 2)
##               cs_mn_avg_ol unempavgall povertyavgall sesavgall
## cs_mn_avg_ol          1.00       -0.57         -0.67      0.76
## unempavgall          -0.57        1.00          0.65     -0.75
## povertyavgall        -0.67        0.65          1.00     -0.92
## sesavgall             0.76       -0.75         -0.92      1.00

Variation in each variable

After seeing the descriptive statistics of each variable, we plot histograms of unemployment, SES scores, poverty, and average academic achievement to display the distribution of each variable. SES is measured as “sesavgall” which is the SES composite score for all families in the district between 2005-2009 and 2014. “povertyavgall” represents the poverty rate (eb estimate) of all families in the district between 2005-2009 and 2014 and the “unempavgall” represents the unemployment rate (eb estimate) of all families in the district between 2005-2009 and 2014. We create histogram and annotate them with the means and standard deviations of each variable. We see that poverty, academic achievement, and SES all have wider spreads of variation and the distribution of unemployment seems to be more narrow.

# How much variation is there in average academic achievement?
ggplot(aes(x=cs_mn_avg_ol), data = dat) +
  geom_histogram(color="black", fill="cyan2", binwidth = 0.05) +
  annotate("text", x=-1.8, y=750, label = "Mean=0.021, SD=0.33",
           size=5, hjust=0, vjust=0) +
  xlab("Average Test Scores (Grade 5.5, CS Scale)") +
  ylab("Frequency") +
  theme_classic()

# How much variation is there in unemployment?
ggplot(aes(x=unempavgall), data = dat) +
  geom_histogram(color="black", fill="lightsalmon1", binwidth = 0.005) +
  annotate("text", x=-0.15, y=1010, label = "Mean=0.06 , SD=0.02",
           size=5, hjust=0, vjust=0) +
  xlab("Average Unemployment") +
  ylab("Frequency") +
  theme_classic()

# How much variation is there in poverty?
ggplot(aes(x=povertyavgall), data = dat) +
  geom_histogram(color="black", fill="darkolivegreen3", binwidth = 0.01) +
  annotate("text", x=0.21, y=600, label = "Mean = 0.126, SD=0.062",
           size=5, hjust=0, vjust=0) +
  xlab("Average Poverty") +
  ylab("Frequency") +
  theme_classic()

# How much variation is there in SES?
ggplot(aes(x=sesavgall), data = dat) +
  geom_histogram(color="black", fill="blueviolet", binwidth = 0.1) +
  annotate("text", x=-3.5, y=500, label = "Mean = 0.33, SD=0.85",
           size=5, hjust=0, vjust=0) +
  xlab("Average SES") +
  ylab("Frequency") +
  theme_classic()

Plots of Association

Now we want to utilize ggplot to plot the association between the three socioeconomic factors including unemployment, SES scores, and poverty with average academic achievement. We also obtain a pearson correlation coefficient (r) for each graph and annotate it on the plot. Each point represent a district and the size of each point is the total enrollment between grades 3-8 in that district. Lastly, we draw a non-linear line that follows the trajectory of our data points. In the first plot, we see that there is a negative association between unemployment and average district academic achievement. Our r= -0.57. Similarly in the next plot, we also observe that there is a negative association between poverty and average district academic achievement with an r = -0.67. In the last plot, we see that there is actually a positive association between SES and average district academic achievement with the highest correlation coefficient of r = 0.76.

# What is the association between unemployment and average academic achievement?
r1 <- round(cor(dat$cs_mn_avg_ol, dat$unempavgall),2)
ggplot(aes(x=unempavgall, y=cs_mn_avg_ol), data = dat) +
  geom_point(alpha = 0.3, pch=21, color = "black", fill = "lightsalmon1", aes(size = totenrl)) +
  geom_smooth(se=F, lwd=0.5, lty=2, method="lm", formula = y~poly(x,3)) +
  annotate("text", x=0.03, y=-1.5, hjust=1, vjust=0,
           label = paste0("r=",r1)) +
  scale_size(range = c(1,30)) +
  guides(size="none") +
  ggtitle("What is the association between Unemployment and average academic achievement?") +
  xlab("Average Unemployment") +
  ylab("Average Test Scores (Grade 5.5, CS Scale)")

# What is the association between poverty and average academic achievement?
r2 <- round(cor(dat$cs_mn_avg_ol, dat$povertyavgall),2)
ggplot(aes(x=povertyavgall, y=cs_mn_avg_ol), data = dat) +
  geom_point(alpha = 0.3, pch=21, color = "black", fill = "darkolivegreen3", aes(size = totenrl)) +
  geom_smooth(se=F, lwd=0.5, lty=2, method="lm", formula = y~poly(x,3)) +
  annotate("text", x=0.05, y=-1.5, hjust=1, vjust=0,
           label = paste0("r=",r2)) +
  scale_size(range = c(1,30)) +
  guides(size="none") +
  ggtitle("What is the association between poverty and average academic achievement?") +
  xlab("Average Poverty") +
  ylab("Average Test Scores (Grade 5.5, CS Scale)")

# What is the association between SES and average academic achievement?
r3 <- round(cor(dat$cs_mn_avg_ol, dat$sesavgall),2)
ggplot(aes(x=sesavgall, y=cs_mn_avg_ol), data = dat) +
  geom_point(alpha = 0.3, pch=21, color = "black", fill = "blueviolet", aes(size = totenrl)) +
  geom_smooth(se=F, lwd=0.5, lty=2, method="lm", formula = y~poly(x,3)) +
  annotate("text", x=-3, y=0.75, hjust=1, vjust=0,
           label = paste0("r=",r3)) +
  scale_size(range = c(1,30)) +
  guides(size=FALSE) +
  ggtitle("What is the association between SES and average academic achievement?") +
  xlab("Average SES") +
  ylab("Average Test Scores (Grade 5.5, CS Scale)")
## Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
## "none")` instead.

Unemployment and Average Academic Achievement

First, we create boxplots to display the range of the median unemployment rates between all 50 states. Here, we observe that South Dakota has the lowest unemployment rate and Mississippi has one of the highest unemployment rates (excluding DC). Then, Montana seems to have similar unemployment rates compared to Massachusetts. We note this to plot four states that we can compare in terms of their unemployment-academic achievement gradient.

Next, we found that according to the Bureau of Labor Statistics, Nebraska has the lowest unemployment rate as of September 2021 at 2.0. Washington and Massachusetts have very similar unemployment rates of 4.9 and 5.2, respectively. Lastly, California has the highest unemployment rate of 7.5. Therefore, we also specifically look at the differences in the unemployment-achievement gradient between Nebraska, Washington, Massachusetts, and California. To start off, we plot the boxplots of unemployment rates for the four states and note the differences in their medians.

Then, in one of our final plots titled “How does the Unemployment-Achievement gradient differ in WA, MA, NE, and CA?”, we notice that CA has a huge range of unemployment rates with the line stretching wide into higher unemployment scores. Meanwhile, the line for Nebraska has a range that stretches into the low unemployment rates. We see that all four lines have negative slopes with higher unemployment rates with lower scores of academic achievement and lower unemployment rates with higher scores of academic achievement. However, when comparing between states, the two lines for CA and NE seem to have very similar slopes that overlap with one another. Massachusetts and Washington have similar ranges of unemployment rates but the line is overall at a higher level. Districts in MA display higher average academic achievement scores in standard deviation units (relative to the national mean) overall compared to CA, NE, and WA.

Finally, in the last plot labeled “How does the Unemployment-Achievement gradient differ in MS, MA, SD, and MT?”, we also plot the gradients for the four states of Massachusetts, Montana, South Dakota, and Mississippi specified earlier. Here, we notice that South Dakota has a wide range of unemployment rates that stretch into lower numbers. MT and SD seem to have similar slopes while MS has a slope that is closer to zero. The unemployment-achievement gradient slope seems to be highest in magnitude (most negative) for MA but again, we notice that the average academic achievement scores are higher overall at all unemployment rates compared to all other states.

We note that there are several flaws in solely using unemployment as a covariate. We believe that there could be additional variation happening that is not just coming from unemployment. For further analyses, we can look at additional covariates of interest that may impact academic achievement such as income earned, % of students receiving free lunch, poverty levels, environmental factors, housing, and social policies. We also ask why Massachusetts has significantly higher scores of academic achievement at every level of unemployment compared to several other states. Perhaps this can be due to the educational system and policies in MA and/or the different programs available to school districts in Massachusetts. Other variables that may impact the academic performance of students in each state include family environments (such as domestic violence and abuse), racism, and other economic and social measures.

# The unemployment distribution between all 50 states ordered by the median
p_unempdist <- dat %>% 
  mutate(stateabb = reorder(stateabb, unempavgall, FUN = median)) %>%
  ggplot(aes(stateabb, unempavgall)) +
  geom_boxplot(color = "darkorchid4") +
  #scale_y_discrete() +
  ylim(0, 0.2) +
  scale_x_discrete() +
  theme(axis.text.x = element_text(angle = 90, hjust = 0.5, size = 7)) +
  ggtitle("Distribution of Unemployment rates for each state") +
  xlab("States") +
  ylab("Unemployment Rate")
p_unempdist
## Warning: Removed 1 rows containing non-finite values (stat_boxplot).

# The unemployment distribution between WA vs. MA vs. NE vs. CA
dat %>%
  filter(stateabb == "WA" | stateabb == "MA" | stateabb == "NE" | stateabb == "CA") %>% 
  ggplot() +
  geom_boxplot(aes(x= stateabb, y=unempavgall, fill=stateabb), color = "black", outlier.colour = "navy", outlier.size = 1, notch = FALSE) +
  scale_y_continuous(trans = "sqrt") +
  xlab("") + 
  ylab("Unemployment") +
  theme(legend.position ="none")

# How does the unemployment gradient compare in WA vs. MA vs. NE vs. CA?
dat_wa_ma_ne_ca <- filter(dat, stateabb %in% c("WA","MA", "NE", "CA"))

ggplot(aes(x=unempavgall, y=cs_mn_avg_ol), data = dat) +
  geom_point(alpha = 0.2, pch=21, color = "black", fill = "grey", aes(size = totenrl)) +
  geom_point(alpha=0.7, pch=21, color="black", aes(size=totenrl, fill=stateabb),
             data = dat_wa_ma_ne_ca) +
  geom_smooth(se=F, lwd=0.5, lty=2, method="lm", color="black", aes(fill=stateabb), formula = y~x, 
              data = dat_wa_ma_ne_ca) +
  scale_size(range = c(1,30)) +
  guides(size="none", color="none") +
  labs(fill = "State") +
  ggtitle("Unemployment-achievement points in WA, MA, NE, and CA?") +
  xlab("Average Unemployment") +
  ylab("Average Test Scores (Grade 5.5, CS Scale)")

ggplot(aes(x=unempavgall, y=cs_mn_avg_ol), data = dat) +
  geom_point(alpha=0.04, pch=21, color="black", aes(size=totenrl, fill=stateabb),
             data = dat_wa_ma_ne_ca, show.legend = FALSE) +
  #geom_smooth(se=F, lwd=0.5, lty=2, method="lm", formula = y~poly(x,3)) +
  scale_size(range = c(1,30)) +
  geom_smooth(se=F, lwd=1, lty=1, method="lm", aes(color=stateabb), formula = y~x,
              data = dat_wa_ma_ne_ca) +
  ggtitle("How does the Unemployment-Achievement gradient differ in WA, MA, NE, and CA?") +
  xlab("Average Unemployment") +
  ylab("Average Test Scores (Grade 5.5, CS Scale)") +
  labs(color = "State")

# How does the unemployment gradient compare in MS vs. MA vs. SD vs. MT?
dat_ms_ma_sd_mt <- filter(dat, stateabb %in% c("MS","MA", "SD", "MT"))

  ggplot(aes(x=unempavgall, y=cs_mn_avg_ol), data = dat) +
  geom_point(alpha=0.04, pch=21, color="black", aes(size=totenrl, fill=stateabb),
             data = dat_ms_ma_sd_mt, show.legend = FALSE) +
  scale_size(range = c(1,30)) +
  geom_smooth(se=F, lwd=1, lty=1, method="lm", aes(color=stateabb), formula = y~x,
              data = dat_ms_ma_sd_mt) +
  ggtitle("How does the Unemployment-Achievement gradient differ in MS, MA, SD, and MT?") +
  xlab("Average Unemployment") +
  ylab("Average Test Scores (Grade 5.5, CS Scale)") +
  labs(color = "State")

Is this relationship specific to unemployment and academic achievement?

Furthermore, we chose to plot the academic achievement by SES composite scores for MA and WA to observe whether we will observe a difference in the relationship between SES and academic achievement and the relationship between unemployment and academic achievement. We see that there are some differences in the range and the gradient. By simply looking at the points, we notice that Massachusetts and Washington generally overlap in their relationships but MA does seem to have slightly higher levels of SES.

Massachusetts and Washington may have similar unemployment rates but may look different on other scales such as housing or poverty or income levels. Therefore, we want to consider other covariates, such as % of free lunch, an indicator that our previous regression tree included as a predictor. We saw that the random forest and decision tree model mentioned earlier concludes that percent of students receiving free lunch has high priority in influencing academic achievement.

# How does the SES gradient compare in WA vs. MA?
dat_wa_ma <- filter(dat, stateabb %in% c("WA","MA"))

ggplot(aes(x=sesavgall, y=cs_mn_avg_ol), data = dat) +
  geom_point(alpha = 0.2, pch=21, color = "black", fill = "grey", aes(size = totenrl)) +
  geom_point(alpha=0.7, pch=21, color="black", aes(size=totenrl, fill=stateabb),
             data = dat_wa_ma) +
  geom_smooth(se=F, lwd=0.5, lty=2, method="lm", color="black", aes(fill=stateabb),
              data = dat_wa_ma) +
  scale_size(range = c(1,30)) +
  guides(size=FALSE, color=FALSE) +
  labs(fill = "State") +
  ggtitle("How does the SES-achievement gradient differ in WA and MA?") +
  xlab("Average SES") +
  ylab("Average Test Scores (Grade 5.5, CS Scale)")
## Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
## "none")` instead.
## `geom_smooth()` using formula 'y ~ x'

Next we further explore the relationship between academic achievement and socioecnomic factors by looking at the percent of students receiving free lunch_____ please continue on from here! hwaiting ye eun!